A hybrid approach for urdu sentence boundary disambiguation
نویسندگان
چکیده
Sentence boundary identification is a preliminary step for preparing a text document for Natural Language Processing tasks, e.g., machine translation, POS tagging, text summarization and etc. We present a hybrid approach for Urdu sentence boundary disambiguation comprising of unigram statistical model and rule based algorithm. After implementing this approach, we obtained 99.48% precision, 86.35% recall and 92.45% F1-Measure while keeping training and testing data different from each other, and with same training and testing data, we obtained 99.36% precision, 96.45% recall and 97.89% F1-Measure.
منابع مشابه
Challenges in Urdu Text Tokenization and Sentence Boundary Disambiguation
Urdu is morphologically rich language with different nature of its characters. Urdu text tokenization and sentence boundary disambiguation is difficult as compared to the language like English. Major hurdle for tokenization is improper use of space between words, where as absence of case discrimination makes the sentence boundary detection a difficult task. In this paper some issues regarding b...
متن کاملAn artificial neural network approach for sentence boundary disambiguation in urdu language text
Sentence boundary identification is an important step for text processing tasks, e.g., machine translation, POS tagging, text summarization etc., in this paper, we present an approach comprising of Feed Forward Neural Network (FFNN) along with part of speech information of the words in a corpus. Proposed adaptive system has been tested after training it with varying sizes of data and threshold ...
متن کاملPii: S0306-4573(01)00044-9
Most work in NLP requires that texts have been previously segmented into sentences and words. Segmenting a text into sentences and words, however, is a complex task, due to the ambiguity of many punctuation marks and spaces. Furthermore, Web texts such as HTML documents are more difficult to make into well refined and segmented texts because they are described in a more free style, with many se...
متن کاملCompound Sentence Segmentation and Sentence Boundary Detection in Urdu
The raw Urdu corpus comprises of irregular and large sentences which need to be properly segmented in order to make them useful in Natural Language Engineering (NLE). This makes the Compound Sentences Segmentation (CSS) timely and vital research topic. The existing online text processing tools are developed mostly for computationally developed languages such as English, Japanese and Spanish etc...
متن کاملPeriods, Capitalized Words, etc
In this article we present an approach for tackling three important aspects of text normalization: sentence boundary disambiguation, disambiguation of capitalized words in positions where capitalization is expected, and identification of abbreviations. As opposed to the two dominant techniques of computing statistics or writing specialized grammars, our document-centered approach works by consi...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید
ثبت ناماگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید
ورودعنوان ژورنال:
- Int. Arab J. Inf. Technol.
دوره 9 شماره
صفحات -
تاریخ انتشار 2012